4 research outputs found
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the
goal of retrieving an image among a set of candidates, which better represents
the meaning of an ambiguous word within a given context. In this paper, we make
a substantial step towards unveiling this interesting task by applying a
varying set of approaches. Since VWSD is primarily a text-image retrieval task,
we explore the latest transformer-based methods for multimodal retrieval.
Additionally, we utilize Large Language Models (LLMs) as knowledge bases to
enhance the given phrases and resolve ambiguity related to the target word. We
also study VWSD as a unimodal problem by converting to text-to-text and
image-to-image retrieval, as well as question-answering (QA), to fully explore
the capabilities of relevant models. To tap into the implicit knowledge of
LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable
answer generation. On top of all, we train a learn to rank (LTR) model in order
to combine our different modules, achieving competitive ranking results.
Extensive experiments on VWSD demonstrate valuable insights to effectively
drive future directions.Comment: Conference on Empirical Methods in Natural Language Processing
(EMNLP) 202
Towards explainable evaluation of language models on the semantic similarity of visual concepts
Recent breakthroughs in NLP research, such as the advent of Transformer
models have indisputably contributed to major advancements in several tasks.
However, few works research robustness and explainability issues of their
evaluation strategies. In this work, we examine the behavior of high-performing
pre-trained language models, focusing on the task of semantic similarity for
visual vocabularies. First, we address the need for explainable evaluation
metrics, necessary for understanding the conceptual quality of retrieved
instances. Our proposed metrics provide valuable insights in local and global
level, showcasing the inabilities of widely used approaches. Secondly,
adversarial interventions on salient query semantics expose vulnerabilities of
opaque metrics and highlight patterns in learned linguistic representations